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I Abstract. A basic feature of many field experiments is that investiga- 

(<~^ ■ tors are only able to randomize clusters of individuals — such as house- 

i holds, communities, firms, medical practices, schools or classrooms — 

' even when the individual is the unit of interest. To recoup the resulting 

. efficiency loss, some studies pair similar clusters and randomize treat- 

^ I ment within pairs. However, many other studies avoid pairing, in part 

■ because of claims in the literature, echoed by clinical trials standards 
, organizations, that this matched-pair, cluster-randomization design has 
I serious problems. We argue that all such claims are unfounded. We 

'~— also prove that the estimator recommended for this design in the liter- 

I ature is unbiased only in situations when matching is unnecessary; its 

J> ' standard error is also invalid. To overcome this problem without mod- 

. eling assumptions, we develop a simple design-based estimator with 

^ I much improved statistical properties. We also propose a model-based 

ff^ ■ approach that includes some of the benefits of our design-based estima- 

, tor as well as the estimator in the literature. Our methods also address 

I ' individual-level noncompliance, which is common in applications but 

Q!^ ■ not allowed for in most existing methods. We show that from the per- 

T\ I spective of bias, efficiency, power, robustness or research costs, and in 

■ large or small samples, pairing should be used in cluster-randomized 
^ , experiments whenever feasible; failing to do so is equivalent to discard- 

^ I ing a considerable fraction of one's data. We develop these techniques 

■ - - ' in the context of a randomized evaluation we are conducting of the 

Mexican Universal Health Insurance Program. 

Key words and phrases: Causal inference, community intervention tri- 
als, field experiments, group-randomized trials, place-randomized trials, 
health policy, matched-pair design, noncompliance, power. 



Kosuke Imai is Assistant Professor, Department of 
Politics, Princeton University, Princeton, New Jersey 
08544, USA e-mail: kimai@princeton.edu; URL: 
http://imai.princeton.edu. Gary King is Albert J. 
Weatherhead III University Professor, Institute for 
Quantitative Social Science, Harvard University, 1737 
Cambridge St., Cambridge, Massachusetts 02138, USA 
e-mail: King@Harvard.edu; URL: 



http://GKing.Harvard.edu. Clayton Nail is Ph.D. 
Candidate, Department of Government, Institute for 
Quantitative Social Science, Harvard University, 1737 
Cambridge St., Cambridge, Massachusetts 02138, USA 
e-mail: nall@fas.harvard.edu. 



1 



2 



K. IMAI, G. KING AND C. NALL 



1. INTRODUCTION 

For political, ethical or administrative reasons, re- 
searchers conducting field experiments are often un- 
able to randomize treatment assignment to individ- 
uals and so instead randomize treatments to clus- 
ters of individuals (Murray, 1998; Donner and Klar, 
2000a; Raudenbush, Martinez and Spybrook, 2007). 
For example, 19 (68%) of the 28 field experiments 
we found published in major political science jour- 
nals since 2000 randomized households, precincts, 
city-blocks or villages even though individual voters 
were the inferential target (e.g., Arceneaux, 2005); 
in public health and medicine, where "the number 
of trials reporting a cluster design has risen expo- 
nentially since 1997" (Campbell, 2004), randomiza- 
tion occurs at the level of health clinics, physicians 
or other administrative and geographical units even 
though individuals are the units of interest (e.g., 
Sommer et al., 1986; Varnell et al., 2004); and nu- 
merous education researchers randomize schools, class- 
rooms or teachers instead of students (e.g., Angrist 
and Lavy, 2002). 

Since efficiency drops when randomizing clusters 
of individuals instead of individuals themselves (Corn- 
field, 1978), many scholars attempt to recoup some 
of this lost efficiency by pairing clusters, based on 
the similarity of available background characteris- 
tics, before randomly assigning one cluster within 
each pair to receive the treatment assignment (e.g.. 
Ball and Bogatz, 1972; Gail et al., 1992; Hill, Ru- 
bin and Thomas, 1999). Since matching prior to 
random treatment assignment can greatly improve 
the efficiency of causal effect estimation (Bloom, 
1978; Greevy et al., 2004), and matching in pairs 
can be substantially more efficient than matching in 
larger blocks, matched-pair, cluster-randomization 
(MPCR) would appear to be an attractive design 
for field experiments (Imai, King and Stuart, 2008). 
[See also Moulton (2004).] The design is especially 
useful for public policy experiments since, when used 
properly, it can be robust to interventions by politi- 
cians and others that have ruined many policy evalu- 
ations, such as when office-holders arrange program 
benefits for constituents who live in control group 
clusters (King et al., 2007). 
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Unfortunately, despite its apparent benefits and 
common usage, this experimental design has an un- 
certain scientific status. Researchers in this area and 
formal statements from clinical trial standards orga- 
nizations (e.g., Donner and Klar, 2004; Feng et al., 
2001; Medical Research Council, 2002) claim that 
certain "analytic limitations" make MPCR, or at 
least the existing methods available to analyze data 
from this design, inappropriate. These claimed lim- 
itations include "the restriction of prediction mod- 
els to cluster- level baseline risk factors (e.g., clus- 
ter size), the inability to test for homogeneity of 
. . . [causal effects across clusters] , and difficulties in 
estimating the intracluster correlation coefficient, a 
measure of similarity among cluster members" (Klar 
and Donner, 1997, page 1754). In addition, in a 
widely cited article, Martin et al. (1993) claim that 
in small samples, pairing can reduce statistical power. 

We show that each of the claims regarding an- 
alytical limitations of MPCR is incorrect. We also 
demonstrate that the power calculations leading Mar- 
tin et al. (1993) to recommend against MPCR in 
small samples is dependent on an assumption of 
equal cluster sizes that vitiates one major advan- 
tage of pair matching; we show in real data that 
the assumption does not apply and without it pair 
matching on cluster sizes and pre-treatment vari- 
ables that affect the outcome improves both effi- 
ciency and power a great deal, even in samples as 
small as three pairs. In fact, because the efficiency 
gain of MPCR depends on the correlation of clus- 
ter means weighted by cluster size, the advantage 
can be much larger than the unweighted correlations 
that have been studied seem to indicate, even when 
cluster sizes are independent of the outcome. 

Finally, there exists no published formal evalua- 
tion of the statistical properties of the estimator for 
MPCR data most commonly recommended in the 
methodological literature. By defining the quantities 
of interest separately from the methods used to es- 
timate them, and identifying a model that gives rise 
to the most commonly used estimator, we show that 
this approach depends on assumptions, such as the 
homogeneity of treatment effects across all clusters, 
that apply best when matching is not needed to be- 
gin with. The commonly used variance estimator is 
also biased. We then offer new simple design-based 
estimators and their variances. We also propose an 
alternative model-based approach that includes the 
benefits of our design-based estimator, which has 
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little or no bias, and the estimator in the litera- 
ture, which under certain circumstances has lower 
variance. Finally, we extend our methods to situa- 
tions with individual-level noncompliance, which is 
a basic feature of many MPCR experiments but for 
which most prior methods have not been adapted. 
With the results and new methods offered here, am- 
biguity about what to do in cluster randomized ex- 
periments vanishes: pair matching should be used 
whenever feasible. 

2. EVALUATION OF THE MEXICAN 
UNIVERSAL HEALTH INSURANCE 
PROGRAM 

As a running example of MPCR, we introduce a 
randomized evaluation we are conducting of Seguro 
Popular de Salud (SPS) in Mexico. A major domes- 
tic initiative of the Vicente Fox presidency, the pro- 
gram seeks "to provide social protection in health 
to the 50 million uninsured Mexicans" (Frenk et al., 
2003, page 1667), constituting about half the popu- 
lation (King et al., 2007). The government intends 
to spend an additional one percent of GDP on health 
compared to 2002 once the program is fully intro- 
duced. 

SPS permitted a cluster randomized (CR) study 
to be built into the program rollout. Under national 
legislation, Mexican states must apply to the federal 
government for funds both to publicize the program 
and fund its operations. The federal government ap- 
proves these requests only when local health clinics 
are brought up to federal standards. When an area 
is approved to begin program enrollment, families 
who affiliate are expected to receive free preventa- 
tive and regular medical care, pharmaceuticals and 
medical procedures. However, because local health 
clinics and hospitals may take years to meet federal 
standards, and also because of budget restrictions, 
a staged rollout was necessary and also allowed us 
the chance to run this randomized study. Finally, 
since SPS allows individuals to decide for themselves 
whether to enroll (if necessary, by traveling from un- 
enrolled to enrolled areas), it was possible to adopt 
a clustered encouragement design (Frangakis et al., 
2002), thereby permitting estimation of individual- 
level program effects. (We focus on the ITT effect 
until Section 6.) 

The MPCR design was implemented in geographic 
areas created for the project which we call "health 
clusters," defined as the geographic catchment area 



of a local hospital or clinic. The country is tiled 
by 12,824 such clusters, and negotiations with the 
Mexican government produced more than 100 for 
which random assignment was acceptable. The cho- 
sen clusters were paired based on census demograph- 
ics, poverty, education, and health infrastructure. 
Within each pair, one "treatment" cluster was ran- 
domly chosen for early program rollout, receiving 
funds to upgrade their health clinics and encour- 
age individual enrollment. The "control" cluster in 
each pair had its rollout set for some future time. 
(Individuals could still obtain SPS benefits by trav- 
eling to SPS-approved clusters, but did not receive 
encouragement or resources to do so.) For design 
details, see King et al. (2007). 

The primary outcome of interest at this stage was 
the level of out-of-pocket health expenditures, while 
secondary outcomes of interest included medical uti- 
lization, health self-assessment and self-reported 
health behaviors. Outcomes were measured in a base- 
line and followup panel survey of more than 32,000 
households. Our examples draw upon 67 of these 
variables measured at the 10-month followup. 

3. MATCHED-PAIR, CLUSTER-RANDOMIZED 
EXPERIMENTS 

We now introduce MPCR experiments, including 
the theories of inference commonly applied (Sec- 
tion 3.1), the formal definitions, notation and as- 
sumptions used in (Section 3.2), and the quantities 
of interest typically sought (Section 3.3). 

3.1 Theories of Inference 

We describe the model-based and permutation- 
based theories of statistical inference that have been 
applied to MPCR data and then the design-based 
theory from which our work is derived. 

First, model-based inference applied to MPCR typ- 
ically uses generalized mixed-effects models, gen- 
eralized estimating equations or multi-level models 
(Feng et al., 2001). Most of these work only if the 
modeling assumptions are correct; they also rely on 
asymptotic approximations. Model-based and model- 
assisted approaches have proved to be powerful in 
other areas, especially in survey research and miss- 
ing data where it is often necessary, but they violate 
the purpose and spirit of experimental work which 
goes to great lengths and expense to avoid these 
types of assumptions. 
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Fisher's (1935) permutation-based theory of infer- 
ence, which constructs exact nonparametric hypoth- 
esis tests based only on the random treatment as- 
signment, has also been applied to MPCR. Although 
permutation inference in principle requires no mod- 
els or approximations, in practice applications typ- 
ically have required additional assumptions such as 
constant treatment effects across clusters or some 
kind of (e.g., Monte Carlo, large sample) approxi- 
mations. The existing applications include Gail et 
al. (1996) and Braun and Feng (2001), which com- 
bine permutation inference with parametric model- 
ing, and Small, Ten Have and Rosenbaum (2008) 
which considers quantile effects using different and 
more modest assumptions. 

In contrast, we use Neyman's (1923) theory of in- 
ference, which is well known but has not before been 
attempted for MPCR. Like Fisher's permutation- 
based theory, Neyman's approach is also design (or 
"randomization") based and nonparametric, but it 
naturally avoids the constant treatment effect as- 
sumption and can provide valid inferences about 
both sample and population average treatment ef- 
fects without modeling assumptions (Rubin, 1991). 
The estimators we derive are also simple to under- 
stand and easier to compute (requiring only weighted 
means and no numerical optimization, or simula- 
tion). 

3.2 Formal Design Definition, Notation and 
Assumptions 

Consider a MPCR experiment where 2m clusters 
are paired, based on a known function of the clus- 
ter characteristics, prior to the randomization of a 
binary treatment. We assume the j'th cluster in the 
fcth pair contains n^-fc units, where j = 1,2 and k = 
1, . . . , m, and thus the total number of units is equal 

to n = EfcLi(^ife +'^2fc)- 

Under MPCR, simple randomization of an indi- 
cator variable, for k = l,2,...,m, is conducted 
independently across the m pairs. For a pair with 
Zk = l, the first cluster within pair k is treated (in 
our case, assigned encouragement to affiliate with 
SPS), and the second cluster is assigned control. In 
contrast, for a pair with = 0, the first cluster is 
the control whereas the second is treated. Thus, us- 
ing Tjfc for the treatment indicator for the jth clus- 
ter in the kth pair, then Tik = Z^ and T2k = 1 — Zk- 
In the context of the SPS evaluation, we consider 
an intention-to-treat (ITT) analysis to estimate the 
causal effects of encouragement to affiliate with the 



program (see Section 6 on the estimation of causal 
effects of the actual affiliation) . 

We denote Yiji,[Tjk) as the potential outcomes 
under the treatment {Tj}^ = 1) and control {Tjk = 
0) conditions for the ith. unit in the jth cluster of 
the A;th pair (Holland, 1986; Maldonado and Green- 
land, 2002). The observed outcome variable is Yijk = 
TjkYijk{l) + (1 - Tjk)Yijk{0). Finally, the order of 
clusters within each pair is randomized so that the 
population distribution of (5^jifc(l), liifc(O)) equals 
(^i2fc(l)) ^2fc(0)) (though this equality may not hold 
in sample). 

A defining feature of CR experiments is that the 
potential outcomes for the ith unit in the jth clus- 
ter of the kth pair are a function of the cluster-level 
randomized treatment variable, Tjk, rather than its 
unit-level treatment counterpart. Similarly, the unit- 
level causal effect, Y^jf^^l) — Yij^^O), is the differ- 
ence between two unit-level potential outcomes that 
are the functions of the cluster-level treatment vari- 
able. Thus, in CR experiments, the usual assump- 
tion of no interference (Cox, 1958; Rubin, 1990) ap- 
plies only at the cluster level. Moreover, in MPCR, 
assuming no interference only between pairs of clus- 
ters is sufficient. This advantage of MPCR designs 
can be substantial if contagion or social influence is 
present at the individual level, where, for example, 
individuals may affect the behavior of neighbors or 
friends, but such interference does not exist across 
clusters or pairs of clusters. Thus, we only assume: 

Assumption 1 (No interference between matched- 
pairs). Let yjjfc(T) be the potential outcomes for 
the ith unit in the jth cluster of the kth matched- 
pair where T is a (m x 2) matrix whose {j, k) el- 
ement is Tjk. We assume that if Tj^ = T'-f^, then 

yi,fc(T) = y,,fc(T'). 

The assumption allows us to write Yijk{T,j]^) rather 
than Yijk{T^). Since Tik = Zk and = l-Zk, Yijk{Tjk) 
only depends on Z^. Given that the assumption of 
no interference among individuals is often highly un- 
realistic (Sobel, 2006), MPCR offers an attractive al- 
ternative. In the Mexico experiment. Assumption 1 
is reasonable because most of the clusters in our 
experiment are noncontiguous and the travel times 
between them are substantial. However, especially 
in small villages, individual- level no interference as- 
sumptions would have been implausible. 

Finally, we formalize the cluster-level randomized 
treatment assignment as follows. 
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Assumption 2 (Cluster randomization under 
matched-pair design). The potential outcomes are 
independent of the randomization indicator variable: 
{Yijk{l), YijkiO)) -ILZk, for all i,j and k. Also, Zk 
is independent across matched-pairs, and Fr(Zk) = 
1/2 for all k. 

The assumption also implies (5^jfc(l), 5^jfc(0)) -U_ 
Tjk since Tjk is a function of Zk- 

3.3 Quantities of Interest 

We now offer the definitions of the causal effects of 
interest under MPCR (or CR in general) which have 
not been formally defined in the literature. At least 
two types of each of four distinct quantities may be 
of interest in these experiments. We begin with the 
four quantities, which define the target population, 
and then discuss the two types, which clarify the 
role of interference. Section 6 introduces additional 
quantities of interest when individual-level noncom- 
pliance exists. (All the quantities below are based 
on causal effects defined as grouped individual-level 
phenomena; we discuss cluster-level causal quanti- 
ties in Section 4.6.) 

3.3.1 Target population quantities. Table 1 offers 
an overview of the four target population causal ef- 
fects. All four quantities represent the causal treat- 
ment effect (the potential outcome under treatment 
minus the potential outcome under control) aver- 
aged over different sets of units. 

First is the sample average treatment effect (SATE 
or ips), which is an average over the set of all units 
in the observed sample (which we denote as 5): 

i^s = ^s{Y{l)-Y{0)) 

(1) 

1 m 2 "jfc 

k=ij=ii=i 

where the sums go over pairs, the two clusters within 
each pair and the units within each cluster. 

The second quantity treats observed clusters as 
fixed (and not necessarily representative of some 
population) and the units within clusters as ran- 
domly sampled from the finite population of units 
within each cluster. This gives the cluster average 
treatment effect (GATE or ipc)'- 

^/jc=Ec{Y{l)-{0)) 

m 2 "jk 

k=ij=ii=i 



Table 1 

Quantities of Interest: For each causal effect, this table lists 
whether clusters and units within clusters are treated as 
observed and fixed or instead as a sample from a larger 
population. The resulting inferential target is also given 



Units 
within 



Quantities 


Clusters 


clusters 


Inferential target 


V's 


SATE 


Observed 


Observed 


Observed sample 


V'c 


CATE 


Observed 


Sampled 


Population within 










observed clusters 




UATE 


Sampled 


Observed 


Observable units 



within the population 
of clusters 

i/)p PATE Sampled Sampled Population 



where the expectation is taken over the set C which 
contains all observed units within the sample clus- 
ters, Njk is the known (and finite) population clus- 
ter size, and N = X]fcLi(^ife + ^"2^)- Throughout, we 
assume simple random sampling within each clus- 
ter for simplicity, but other random sampling pro- 
cedures can easily be accommodated via unit-level 
weights. Thus, the only difference between SATE 
and CATE is whether each unit within clusters is 
treated as fixed or randomly drawn based on a known 
sampling mechanism. 

A third quantity treats the clusters as randomly 
sampled from a larger population, but the units within 
the sampled clusters are treated as fixed. The infer- 
ential target is the set U, which includes all units in 
the population of clusters that would be observed if 
its cluster were in the observed sample. This is what 
we call the unit average treatment effect (UATE or 
ipu) and is defined as ipu = ^^^(^(1) - ^(0)). 

The final quantity of interest is the population av- 
erage treatment effect (PATE or ipp), which is de- 
fined as Ipp = E-p(y(l) — Y{0)), where the expec- 
tation is taken over the entire population V — that 
is, the population of units within the population of 
clusters. For simplicity throughout, we assume an 
infinite population of clusters, but this is easily ex- 
tended to finite populations at some cost in addi- 
tional notation. 

Researchers should design their experiments to 
make inferences to their desired quantity of inter- 
est, though in practice they may choose to esti- 
mate other quantities of interest when they face de- 
sign limitations. In the SPS evaluation, for exam- 
ple, we would like to infer PATE for all of Mexico, 
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but our health clusters were not (and due to po- 
litical and administrative constraints could not be) 
randomly selected. This means that, like most med- 
ical experiments, any method applied to our data to 
estimate PATE will be dependent on assumptions 
about the selection process. An alternative approach 
would be to try to estimate one of the other quan- 
tities. GATE or SATE are straightforward possibil- 
ities, and GATE is probably most apt in this case, 
since individuals within clusters were randomly se- 
lected, and both quantities condition on the clus- 
ters we observe. Of course, even when inferences are 
made to restricted populations, readers may still ex- 
trapolate to a different population of interest, and 
so the researcher needs to decide on the appropri- 
ate presentation strategy. From a public policy per- 
spective, UATE may be a reasonable target quan- 
tity, where we try to infer to the individuals who 
would be sampled in all the health clusters in Mex- 
ico that are similar to our observed clusters, and 
from which our clusters could plausibly have been 
randomly drawn. 

3.3.2 Interference. Inference in GR experiments 
may be affected by three different types of interfer- 
ence, each of which may require different assump- 
tions. First, when interference exists among individ- 
uals within a cluster, the potential outcomes of one 
person (or unit) within a cluster may be different de- 
pending on other units' treatment assignment. This 
type of interference is expected and no assumptions 
are required for the four causal quantities of inter- 
est. In GR experiments, within-cluster interference 
is part of the outcome, and researchers can estimate 
the causal effects of cluster-level treatment on unit- 
level outcome. Understanding the effect of individu- 
als independent of and isolated from other individ- 
uals in the same cluster is best left to studies where 
individual randomization is possible. 

Second, interference between clusters in different 
pairs may affect outcomes. Assumption 1 requires 
the absence of such interference between clusters 
in different pairs. We continue to maintain this as- 
sumption, as Sobel (2006) demonstrates that with- 
out it even the definition of a causal effect is com- 
plicated (see also Rosenbaum, 2007). 

Third, interference between treatment and con- 
trol clusters in the same pair requires us to redefine 
causal effects to account for interference. For exam- 
ple, if one cluster is assigned SPS, individuals in the 
other (control) cluster within the pair may become 



envious or depressed as a consequence. This type 
of interference within a pair can be dealt with in 
two ways. In the first, which we call no-interference, 
we define the causal effect (SATE, GATE, UATE 
or PATE) so that the treatment in one cluster has 
no effect on the potential outcomes of units in the 
control cluster. In the second, which we call the with- 
interference, the causal effect is defined so that it in- 
cludes interference between clusters within pairs as 
well as interference between units within each clus- 
ter. (For our Mexico experiment, we do not expect 
much direct interference within or across pairs, al- 
though nearby clusters outside our experiment might 
exert some influence over those we observe, in which 
case the definition of UATE or PATE might change) . 

Estimating the no-interference version of SATE, 
GATE, UATE or PATE in the presence of interfer- 
ence is feasible only with assumption-laden estima- 
tors. In contrast, estimating the with-interference 
version is easier since it accepts whatever level of 
non-interference one's data happens to present. Of 
course, having a quantity that is easy to estimate 
is not a satisfactory substitute for having an esti- 
mate of the quantity of interest. The best way to 
avoid this problem is to use these facts to design 
better experiments. For example, we can select non- 
contiguous clusters to pair, and pairs that are not 
contiguous to other pairs. Following rules like this 
whenever feasible reduces the difference between the 
no- interference and with-interference quantities. 

4. ESTIMATORS 

We now define our estimators and derive their 
statistical properties. Our strategy throughout is to 
make as few assumptions as feasible beyond the ex- 
perimental design. We also briefiy discuss an ap- 
proach that has been offered in the literature. Since 
our approach has little or no bias, and the exist- 
ing estimator is biased but may have low variance 
in some circumstances, we also offer a model-based 
method that combines some of the benefits of both 
approaches. 

4.1 Definitions 

The point estimators for the with-interference ver- 
sion of the four quantities of interest are each weighted 
averages of within-pair mean differences between the 
treated and control clusters, but with different weights 
We thus define 
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4.2 Bias 



(3) 



m f 

k=l ^ 



i=l ^ilk 



i=l ^i2k 

n2k 



nik 
+ (1 - Zk) 

i=l J^i2k l^i=l ^ilk 



nik 



where the weight for the kth pair of clusters, denoted 
by Wk, defines a specific estimator. 

The estimator most commonly recommended in 
the methodological literature is based on a weight 
using the harmonic mean of sample cluster sizes, 
which can be written as 'ip{nikn2k/ {nik + n2k)) 
(see, e.g., Donner, 1987; Donner and Donald, 1987; 
Donner and Klar, 1993; Hayes and Bennett, 1999; 
Bloom, 2006; Raudenbush, 1997; Turner, White and 
Croudace, 2007). This estimator, and its variance es- 
timator, are in general biased (see Appendix A. 4), 
but may have low variance in some situations, an 
issue we return to in Section 4.5. 

As shown in Table 2, '4^{nik + n2k) is our point esti- 
mator for both SATE and UATE, whereas Tp{Nik + 
N2k) apphes to both GATE and PATE. This is in- 
tuitive, as SATE and UATE are based on those 
units (which would be) sampled in a cluster whereas 
GATE and PATE are based on the population of 
units within clusters. Our estimator for SATE and 
UATE differs from the existing estimator based on 
harmonic mean weights unless the sample cluster 
sizes within each matched pair are equal {nik = n2k 
for all A; = 1, . . . , m), which rarely occurs at least in 
field experiments. 

Table 2 also summarizes the variances and their 
estimators. Under our design-based inference, UATE 
and PATE have identifiable variances, the exact ex- 
pression for which we give below. SATE and GATE 
have unidentifiable variances, and so we offer their 
upper bound, leading to a conservative confidence 
interval. Our variance estimators differ from the ex- 
isting estimator even when sample cluster sizes are 
matched exactly. Our variance estimator is approx- 
imately unbiased for any weights. Estimates from 
UATE and PATE (or equivalently SATE and GATE) 
will differ depending on how sample and population 
sizes vary across clusters. 



We first focus on SATE. This allows us, follow- 
ing Neyman (1923), to use the randomized treat- 
ment assignment mechanism as the sole basis for 
statistical inference (see also Imai, 2008). Here, the 
potential outcomes are assumed fixed, but possibly 
unknown, quantities. We begin by rewriting ilj{nik + 
n2k) using potential outcome notation: 

'ip{nik+n2k) 

^ m 

= - y]("ifc +™2fc) 

n 



k=l 



r,f m\Y^ik{l) i:Z\Yi2k{^) 

V nik n2k 
+ (1 - Zk) 

T:l\Y^2k{l) Y:i\Yak{0) 



n2k 



nik 



Then, taking the expectation with respect to Zk 
yields 



(4) 



Ea{i^{nik + n2k)} -i>s 

-, m 2 ( / I 



n 



7ij=i [ ^ 



njk 



E 

i=l 



Y,jk{l)-Y,,k{^) 
njk 



where the expectation is taken with respect to the 
randomization of treatment assignment which we in- 
dicate by the subscript "a." 

Although the bias does not generally equal zero, 
either of two common conditions can eliminate it. 
These two conditions motivate our choice of weights 
{wk = nik + n2k)- First, when cluster sizes are equal 
within each matched-pair (i.e., nik = ^2fc for all k), 
the bias is always zero. This implies that researchers 
may wish to form pairs of clusters, at least par- 
tially, based on their sample cluster size if SATE 
is the estimand. Second, ip{nik + n2k) is also un- 
biased if matching is effective, so that the within- 
cluster SATEs are identical for each matched-pair 
(i.e., E"l\{Yiik{l) - Yiik{0))/nik = E7=i{Yi2k{l) - 
Yi2k{0))/n2k for all k). In contrast, bias may remain 
if cluster sizes are poorly matched and within each 
pair cluster sizes are strongly associated with the 
cluster-specific SATEs. However, the bounds on the 
bias can be found by applying the Gauchy-Schwarz 
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Table 2 

Point estimators and variances for the four causal quantities of interest. "Identified" refers to design-based identification of 

estimated causal effects without modeling assumptions 





SATE 


GATE 


UATE 


PATE 


Point estimator 


V'(nifc + n2k) 


^(A^ifc+iV2fc) 


ip(nik + n2k) 


V'(A^ifc+A^2fe) 


Variance 


Var,(7/i) 


Vara„(7/)) 


Varap(i^) 


Vara„p(i^) 


Identified 


no 


no 


YES 


YES 



inequality to equation (4) and they can be consis- 
tently estimated from the observed data. In sum, 
roughly speaking, if cluster sizes and important con- 
founders are matched well so that pre-randomization 
matching accomplishes the purpose for which it was 
designed, this estimator will be approximately un- 
biased. 

A similar bias expression can be derived for our 
GATE estimator, V^C-^ifc + ^2k): where the weights 
are now based on the arithmetic mean of the pop- 
ulation cluster sizes rather than their sample coun- 
terparts. A calculation analogous to the one above 
yields the following bias expression: 

Eau(.i'(.Nik+N2k))-iJC 

fc=ij=i 

■EuiYijkil)-Y,jk{0))y 

where subscript "aw" means that the expectation is 
taken with respect to random treatment assignment 
and the simple random sampling of units within 
each cluster. The conditions under which this bias 
disappears are analogous to the ones for SATE: If 
matching is effective so that the cluster-specific av- 
erage causal effects, that is, Eu[Yijk{l) — Yij^^O)], are 
constant across clusters within each pair, then the 
bias is zero. The bias also vanishes if the population 
cluster sizes are identical within each pair, that is, 
iVifc = for all k. Again, the bounds on the bias 
can be obtained in the manner similar to the case 
of SATE above. 

Finally, the bias for UATE and PATE can be 
obtained by taking the expectation of the bias for 
SATE and GATE, respectively, with the expectation 
defined with based on random sampling of cluster 
pairs. If the within-cluster sample (population) av- 
erage treatment effects are uncorrelated with clus- 
ter sizes within each matched-pair, then the bias for 



the estimation of UATE (PATE) is zero, regardless 
of whether one can match exactly on cluster sizes. 
In general, however, cluster sizes may be correlated 
with the size of average treatment effects. In such 
cases, the matching strategies to reduce the bias for 
the estimation of SATE (GATE) also work for the 
estimation of UATE (PATE). That is, pairs of clus- 
ters should be constructed such that within each 
pair, cluster sizes and important pre-treatment co- 
variates are similar. (We also derived an unbiased 
estimator and its variance, but we do not present 
it here because they are not invariant to a constant 
shift of the outcome variable when cluster sizes vary 
within each pair.) 

4.3 Variance 

In a critical comment about Klar and Donner (1997), 
Thompson (1998) shows how to obtain valid vari- 
ance estimates under the linear mixed effects model 
and the "common effect assumption." In their reply, 
Klar and Donner (1998) criticize the common effect 
assumption and, as a result, maintain their claim of 
analytical difficulties with MPGRs. We show here 
how to obtain valid variance estimates without the 
common treatment effect assumption or other mod- 
eling assumptions. 

Rather than focusing on each of our proposed esti- 
mators, ip{nik + n2k) and ip{Nik + N2k), separately 
we consider the variance of the general estimator, 
ip{wk) in equation (3), so that the analytical re- 
sults we develop apply to any choice of weights in- 
cluding the harmonic mean weights. For notational 
simplicity, we use normalized weights, that is, Wk = 
'nwk/Y^^=i Wk (so that the weights sum up to n as in 
our estimator of SATE and UATE), and consider the 
variances of ip{'Wk)- First, we use potential outcomes 
notation and write il^iwk) = J2T=i Wk{ZkDk{l) + {l- 
Zk)DkiO)} /n. Then, our variance estimator is 

a{wk) 
_ m 
{m — l)n^ 
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E 

fc=i 



Wk{ Zk 



i2k 



(6) 



nik 
+ (1 - ^fc) 



n2k 



i=l li2k l^i=l J^i 



ilk 



n2k 



nik 



m 



SATE. We first consider the variance of ipiwk) for 
SATE. Taking the expectation of ip{wk) with respect 
to Zk, the true variance of ipi^Wk) is given by 

m 

(7) YaiaiHwk)) = ^T. ^li^kil) - Dkmf. 



k=l 



This variance is not identified since we do not jointly 
observe Dk{l) and Dk{0) for each k. Thus, we iden- 
tify an upper bound of this variance, making no ad- 
ditional assumptions, and estimate it from the ob- 
served data. 

The next proposition establishes that the true vari- 
ance, Vara{'4'{wk)), is not identifiable, and shows 
that our proposed variance estimator, a{wk), is con- 
servative. 

Proposition 1 (SATE variance identification). 
Suppose that SATE is the estimand. Then, the true 
variance of il){wk) is not identifiable. The bias of 
(j{wk) is given by 

Ea{cr{wk)) -Vavai'4'{wk)) 

TTi 

= —vaj:{wk{Dk{l) + DkiO))}, 

where var(-) represents the sample variance with de- 
nominator m — 1. 

See Appendix A.l for a proof. This proposition 
implies that on average a^Wk) overestimates the 
true variance Vava{tlj{wk)) unless the sample vari- 
ance of weighted within-cluster SATEs across pairs 
is zero. For example, if SATE is constant across 
pairs, and the cluster sizes are equal, (T{wk) esti- 
mates the true variance without bias. However, such 
a scenario is highly unlikely under MPCR, and thus 
oiwk) should be seen as a conservative estimator 
of the variance. It is also possible to obtain a less 
conservative variance estimate than a{'Wk)- For ex- 
ample, researchers may use a consistent estimator of 
{(E^=l*iA(l)^)^/^ + (E^=l*IA(l)2)-^/nV4n^ 
which is obtained by applying the Cauchy-Schwarz 



inequality to equation (7). Another approach to ob- 
tain a tighter bound would be to apply the covari- 
ance inequality to the bias expression given in Propo- 
sition 1. 

GATE. Next, we study variance for GATE, ijj{wk), 
which we write as 

Varauii^iwk)) 

(8) +\aiu{Eaii>iwk))} 

— Y^wl{E^{Dk{l)-Dkm^ 



k=l 



+ Varj2?fc(l)+L»fc(0))}, 

where the second equality holds because sampling of 
units is independent within clusters. Similar to the 
SATE variance, this is not identified since we do not 
jointly observe Dk{l) and Dk{0) for each k. The next 
proposition shows that a{wk) is again conservative. 

Proposition 2 (GATE variance identification). 

Suppose that CATE is the estimand. The true vari- 
ance ofilj{wk), VaiCauii'iwk)) , is not identifiable. The 
bias of (j{wk) is given by 

Ea{cF{Wk)) - \aia{'4^{Wk)) 

= —yav{wkEu{Dk{l) + DkiO))}. 

See Appendix A. 2 for a proof. The proposition 
implies that our proposed variance estimator, a{wk), 
is an upper bound of the true variance. As in the case 
of the SATE, this upper bound can be improved. For 
example, rewrite the variance in equation (8) as 



Yaiauiipiwk)) 



(9) 



1 



2n2 ^ 



k=l 



Var4Dfc(l)) + Var„(A.(0)) 



+ -{Eu{Dk{l) 



Dkiom' 



Then, apply the Gauchy-Schwarz inequality to the 
third term in the bracket of equation (9). Alterna- 
tively, applying the covariance inequality to the bias 
expression in Proposition 2 yields a tighter bound. 

UATE and PATE. Unlike in the case of the SATE 
and the GATE, the variance of ip is identified and 
can estimated approximately without bias when UATE 
or PATE is the estimand. We establish this result as 
the following proposition: 
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Proposition 3. Conditional on w = Yuk=i ^fc/ 
m, the variances of 'ip{'Wk) for estimating the UATE 
and PATE are given by: 

yaiapuii'iwk)) = —^[Ep{wlY&ru{Dk)} 

+ \s.rp{wkEu{Dk)]], 

respectively, where Dk = ZkDk{l) + (1 — Zk)Dk{0) 
and "p" represents the expectation with respect to 
simple random sampling of matched-pairs of clus- 
ters. Conditional on w, both variances can be esti- 
mated by (j{wk) without bias under their correspond- 
ing sampling schemes. 

See Appendix A. 3 for a proof. The proposition 
shows that when estimating PATE, the variance of 
tp{wk) is proportional to the sum of two elements: 
the mean of within-cluster variances and the vari- 
ance of within-cluster means. If all units are included 
in each cluster, then the first term will be zero be- 
cause the within-cluster means are observed without 
sampling uncertainty, that is, Var„(Dfc) = for all 
k. In either case, however, our proposed variance es- 
timator a{wk) is unbiased, conditional on the mean 
weight, w. 

Inference. Given our proposed estimators and vari- 
ances, we make statistical inferences by assuming 
that '4){wk) is approximately unbiased. We consider 
three situations: 

1. Many pairs. When the number of pairs is large 
(regardless of the number of units within each 
cluster), no additional assumption is necessary 
due to the central limit theorem. For PATE and 
UATE, the level a confidence intervals are given 

by [i^{Wk) - Za/2V^iWk), tpiwk) + Za/2VHlJJk)] 

where Za/2 represents the critical value of two- 
sided level a normal test. For the SATE and 
GATE, the confidence level of this interval will 
be greater than or equal to a. 

2. Few pairs, many units. For GATE (and PATE), 
the central limit theorem implies that Dk fol- 
lows the normal distribution. Since the weights 
are assumed to be fixed for GATE, WkDk is also 
normally distributed. For the other three quan- 
tities, we assume WkDk is normally distributed. 
In either case, the level a confidence intervals 
are given by [i){ wk) - tm-i,a/2\/^{wk), Tp{wk) + 
tm-i,a/2\/oiwk)], where t„^_i^a/2 represents the 



critical value of the one-sample, two-sided level 
a t-test with (m — 1) degrees of freedom. For the 
SATE and GATE, the confidence level of this in- 
terval will be greater than or equal to a. 
3. Few pairs, few units. When little information is 
available, a distributional assumption is required 
for the inferences about all four quantities. We 
may assume WkDk follows the normal distribu- 
tion as above and construct the confidence in- 
tervals and conduct hypothesis tests based on t- 
distribution. 

Finally, although it was once thought that the 
need for, and inability to estimate, the intraclus- 
ter correlation coefficient (IGG) was a major dis- 
advantage of MPGR designs (Gampbell, Mollison 
and Grimshaw, 2001; Klar and Donner, 1997; Don- 
ner, 1998), estimates of the IGG are in fact not 
needed for our estimators or their variances. Below, 
we also show that efficiency analysis, power com- 
parisons and sample size calculations can also be 
conducted without the IGG estimation. 

4.4 Performance in Practice 

We now study how our estimator and the har- 
monic mean estimator work in practice. The results 
here also motivate a combined approach to estima- 
tion we offer in Section 4.5. 

Confidence interval coverage. To construct realis- 
tic simulations, we begin with the observed cluster- 
specific mean for two out-of-pocket health expen- 
ditures from the SPS evaluation data (measured in 
pesos) and use this to set the potential outcomes' 
true population for the simulation. Finally, we gen- 
erate the outcome variables via independent normal 
draws for units within clusters using a set of het- 
erogeneous variances. Thus, the existing harmonic 
mean estimator's mean and variance constancy as- 
sumptions are violated, as is common in real data, 
although its normality and independence assump- 
tions are maintained. (Replication data are available 
in Imai, King and Nah, 2009.) 

We study the properties of the proposed and exist- 
ing variance estimators with PATE or UATE as the 
estimand. (As shown in Proposition 2, the GATE 
variance is not identified and the expectation of our 
variance estimator equals a upper bound.) We gen- 
erate a population of clusters by bootstrapping the 
observed pairs of SPS clusters along with their ob- 
served means and a set of heterogeneous variances. 
We then compute coverage probabilities under both 
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estimators where the arithmetic and harmonic mean 
weights are used for the proposed and existing es- 
timators, respectively. We draw from the discrete 
empirical distribution, which is far from a Gaus- 
sian distribution, yielding a hard case for both esti- 
mators. The left panels of Figure 1 summarize the 
results. As expected due to the central limit theo- 
rem, both sets of our 90% confidence intervals (solid 
disks) approach their corresponding nominal cover- 
age probabilities as the number of pairs increase. In 
contrast, the confidence intervals based on the har- 
monic mean variance estimator (open diamonds) are 
biased — too wide in the top graph and too narrow 
in the bottom — and the magnitude of bias does not 
decrease even as the number of pairs grows. 

Standard error comparisons. We begin by com- 
puting the standard error (the square root of the 
estimated variance) based on the general variance 
formula proposed in Donner (1987), Donner and Don- 
ald (1987) and Donner and Klar (1993), as well as 
the one based on our approximately unbiased al- 
ternative. For comparability, we use the arithmetic 
mean weights for both standard error calculations. 
We make these computations for a large number 
of outcome variables from the SPS evaluation sur- 
vey conducted 10 months after randomization. The 
outcome variables include some which were binary 
(e.g., did the respondent suffer catastrophic medi- 
cal expenditures? Does our blood test indicate that 
the respondent has high cholesterol? Has the respon- 
dent been diagnosed with asthma?) and others de- 
nominated in Mexican pesos (e.g., out-of-pocket ex- 
penditures for health care, for drugs, etc.). We then 
divided this standard error by our alternative for 
each variable. The top right graph in Figure 1 gives 
a smoothed histogram of these ratios (plotted on 
the log scale but labeled in original units, with 1 
the point of equality). In these real data, the bi- 
ased standard errors range from about two times 
too small to two times too large. Note that the cen- 
tral tendency of this histogram has no particular 
meaning, as it is constructed from whatever ques- 
tions happened to be asked on the survey. The key 
point is that in real data the deviation from the 
approximately unbiased estimator for any one such 
standard error can be large in either direction. 

Bias-variance tradeoff. Using data from an expen- 
diture outcome in the SPS sample, we simulate an 
instance in which the variance of the existing es- 
timator outperforms our estimator. To distinguish 



between the harmonic and arithmetic mean, we be- 
gin by setting all within-pair cluster sizes equal to 
the size of the treatment cluster in the SPS eval- 
uation. Then, keeping the total pair size constant, 
we increase the difference in within pair cluster size 
such that the added difference in cluster sizes is pro- 
portional to the within-pair treatment effect. This 
leaves the average treatment effect constant while 
demonstrating differences in the two weighting schemes. 
The bottom right graph in Figure 1 presents the ab- 
solute difference between the two estimators in mean 
square error, squared bias and variance, with the ob- 
served SPS value marked with a vertical line. 

The overall picture from these results indicates 
that the arithmetic estimator would be preferred be- 
cause it has lower mean square error than the har- 
monic mean estimator. However, at the expense of 
introducing bias when treatment effect is both vari- 
able across pairs and correlated with the cluster size, 
the harmonic mean estimator can have substantially 
lower variance. These results suggest the possibility 
of an improved estimator based on the combination 
of both approaches, a subject to which we now turn. 

4.5 An Encompassing, Model-Based Approach 

The standard harmonic mean estimator is unbi- 
ased when applied to data where the between-cluster 
homogeneity assumption holds. In this situation, the 
harmonic mean weights also have the attractive prop- 
erty of downweighting observations with worse matches 
and larger variances, thereby reducing variance. If 
the homogeneity assumption is violated, however, 
then one cannot afford to downweight pairs, no mat- 
ter how badly matched or imprecisely measured, be- 
cause doing so could result in arbitrarily large biases. 
In contrast, the arithmetic mean estimator avoids 
bias by making no assumptions about the nature 
of how treatment effects vary over the pairs. How- 
ever, a consequence of it imposing no structure on 
treatment effect heterogeneity is that mismatched 
pairs are not downweighted and so some inefficiency 
may result if in fact the treatment effects are similar 
across pairs. 

We now combine the insights of these two ap- 
proaches and propose a single encompassing model 
that provides some of the advantages of each, at the 
cost of somewhat more stringent assumptions than 
with our design-based approach. Consider data with 
m* groups of clusters, where the homogeneity as- 
sumption holds within each group. Assume that the 
clusters within any one pair are never split between 
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Fig. 1. Inference Accuracy. Simulations in the left panels demonstrate how our estimator's coverage is approximately correct 
and increasingly so for larger sample sizes, while the existing estimator can yield confidence intervals that are either too large 
(top left) or too small (bottom left). The top right panel uses real data to give the ratios of the harmonic mean standard error 
to our approximately unbiased alternative on the horizontal axis (on the log scale, but labeled as ratios). The bottom right 
figure gives squared bias, MSE, and variance comparisons as a function of the average cluster size ratio; a vertical line marks 
the observed heterogeneity m the SPS data. 



groups. Let g{k) = I denote the group to which pair 
k belongs, / = 1, 2, . . . , m* with m* < m. Then, make 

.(am 



the modehng assumption that Yijj^it) Af{fi 
^iaik))'^ for t = 0,1 where /if ^ and a^^^ are not nec- 
essarily equal to /if ^ and o"^''-* for / 7^ Under this 
model, GATE equals, 



1 



(10) ^C = ^E(^l^^+^2fc)(M' 



k=l 



When the group membership is known ex ante, an 
unbiased and efficient estimator of GATE is given by 



1 ■ (0 (0 
replacmg ^j, = n 



(0 



with its harmonic mean 



estimate 



-(0 



Y,l{g{k)=l}wkDu, 



k=l 



^ l{g{k') = l}wk' 

Lk'=l 



where Wk = nikn2k / {nik + n2k) and !{•} represents 
the indicator function. Thus, this mixture model es- 
timator is an arithmetic mean of within (homoge- 
neous) group harmonic mean estimators. A special 



case is the harmonic mean estimator in the liter- 
ature, where the homogeneity assumption is made 
across all clusters, that is, m* = 1. When every pair 
belongs to a different group, that is, m* = m, this 
estimator approximates our proposed design-based 
estimator. In many applications, the group mem- 
bership as well as the number of groups may be un- 
known. In this case, GATE may be estimated via 
standard methods for fitting finite mixture models 
(e.g., McLaughlan and Peel, 2000). 

4.6 Cluster-Level Quantities of Interest 

The eight quantities of interest defined in Sec- 
tion 3.3— SATE, GATE, PATE and UATE, both 
with and without interference — are all defined as ag- 
gregations of unit-level causal effects. For some pur- 
poses, however, analogous quantities of interest can 
be defined at the cluster level. For example, quan- 
tities of interest in the SPS evaluation include the 
health clinic-level variables. Some of these effects, 
such as the supply of drugs and doctors, are defined 
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and measured at the health chnic, and so are effec- 
tively unit-level variables amenable to cluster-level 
analyses. 

However, for other variables, individual-level sur- 
vey responses are required to measure the aggre- 
gate variables. Examples include the success health 
clinics in our experiment have in protecting privacy, 
reduce waiting times, etc. If these latter variables 
are used to judge the causal effect of SPS on the 
clinics, we have a CR experiment, but a quantity 
of interest at the cluster level. In this situation, our 
estimator is a special case of equation (3), with a 
constant weight, Similarly, the variance of this 

estimator is a special case of our general formulation 
in equation (6), a"(l). This estimator for aggregate 
quantities is unbiased and invariant for all quanti- 
ties of interest. In the case of unit-level variables 
amenable to cluster-level analysis (such as collected 
via survey), there will likely be sampling error and 
so may result in a larger variance. 

5. COMPARING MATCHED-PAIR AND 
OTHER DESIGNS 

We now study the relative efficiency and power of 
the MPCR and unmatched cluster randomization 
(UMCR) designs, and give sample size calculations 
for MPCR. We also briefly compare MPCR with the 
stratified design and discuss the consequences of loss 
of clusters under each. 

5.1 Unmatched Cluster Randomized Design 

The UMCR design is defined as follows. Consider 
a random sample of 2m clusters from a population. 
We observe a total of nj units within the jth cluster 
in the sample, and use n to denote the total num- 
ber of units in the sample, n = X]^=i "'j- Under this 
design, m randomly selected clusters are assigned 
to the treatment group with equal probability while 
the remaining m clusters are assigned to the control 
group. 

We construct an estimator analogous to that pro- 
posed for the UMCR as 

r, 2m rij _ 
y ^ 0/1 . 

^(^.■) --EE zriz^Y,, - (1 - 

7 = 1 1 = 1 -J 

(11) 

= - EE - (1 - z,)Y.m}, 

j = ll=l J 

where Zj is the randomized binary treatment vari- 
able, Yij{t) is the potential outcome for the ith unit 



in the jth cluster under the treatment value t for 
t = 0,l, and Wj is the known normalized weight with 
^, ^ ^_ g^rpj, UATE, we use Wj = Uj. 

For CATE and PATE, we use Wj oc Nj where Nj is 
the population size of the jth cluster. Analysis sim- 
ilar to the one in Section 4.2 shows that this esti- 
mator is unbiased for all four quantities in UMCR 
experiments. 

The commonly used estimator in the literature 
for this design takes a form slightly different from 
equation (11): k = Y.]=i Zj YhU Yij/Y.]=i ZjUj + 
E|=i(l - ^,)Er=^i>^,/(n - E-=i^,%). This esti- 
mator is applicable to SATE and UATE but not 
CATE and PATE because it ignores cluster popula- 
tion weights. The estimator is also biased for SATE 
and UATE, and the magnitude of bias can be de- 
rived using the Taylor series. Without modeling as- 
sumptions, the exact variance calculation is difficult 
within the design-based framework because the de- 
nominator as well as the numerator is a function 
of the randomized treatment variable. In addition, 
the usual approximate variance calculations for such 
a ratio estimator yield either the same variance as 
f{nj) or the variance estimator that is not invariant 
to a constant shift. Thus, for the sake of simplicity, 
we focus on f{wj) in this section although k and its 
approximate variance estimator may perform rea- 
sonably well in practice. 

For the rest of this section, we assume that the 
estimand is UATE. However, the same calculations 
apply when the estimand is PATE since the vari- 
ance estimator is the same for both. For SATE and 
CATE, we can interpret these results as conservative 
estimates of efficiency, power and sample sizes. 

5.2 Efficiency 

When the estimand is UATE, the variance of f{wj) 
is approximately (conditional on n = j=i ) equal 
to Yaiacifjwj)) = ^{YaVcjwjYjjl)) + Var,(% ■ 
Yj{0))}, where Yj{t) = Y.ZiYij{t) /uj for t = 0,1, 
and the subscript "c" represents the simple random 
sampling of clusters. To facilitate comparison, as- 
sume that under MPCR one is able to match on 
cluster sizes so that nik = n2k for all k. Proposition 3 
implies that under the same condition the variance 
of ipiwk) can be approximated by Vavap{ilj{wk)) = 
mVar:p{wk{Yjk{l) - Yj/fc(0))}/n2, where Yjk{t) = 
J2i=i Yijk{t)/njk and j / Since the assumption of 
nik = n2k means Wjk = 2wj, we have Varp{'WkYjk{t)) = 
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AVaiciwjYjit)) for t = 0,l. Thus, 
ciency of MPCR over UMCR is 

VaTacif{Wj)) 



the relative effi- 5.3 Power 



Varapiipiwk)) 



1 



2CoVp{wkYjk{l),WkYj>k{0)) \ ' 
J2l^QYarp{wkYjk{t)) J 



This implies that the relative efficiency of MPCR 
depends on the correlation of the observed within- 
pair cluster mean outcomes weighted by cluster sizes. 
If matching induces a positive correlation, as is its 
purpose and will normally occur in practice, then 
MPCR is more efficient. (In the worst case scenario 
where matching is implemented in a manner oppo- 
site to the way it was designed, and thus induces 
a negative correlation, MPCR can be less efficient.) 
Under MPCR, we can estimate CoVp{'WkYjk{l), 
W}~Yjik{0)) without bias using the sample covariance 
between ■WkYjig{l) and WkYj'f;{0), which are jointly 
observed for each k. And thus, under MPCR, the 
variance one would obtain under UMCR can also be 
estimated without bias (see also Imai, 2008). This 
is another advantage of MPCR since the converse 
is not true. (If cluster sizes are equal, one can also 
estimate the ICC nonparametrically and separately 
for the treated and control groups — there is no rea- 
son to assume the ICC is the same for two potential 
outcomes as done in the literature. Note that the 
ICC is not required for efficiency, power or sample 
size calculations.) 

Empirical evidence. Although the MPCR design 
have other advantages in public policy evaluations 
(King et al., 2007), their advantage in statistical ef- 
ficiency can be considerable. We estimate the ef- 
ficiency of MPCR as used in the SPS evaluation 
over the efficiency that our experiment would have 
achieved, if we had used complete randomization 
without matching. Figure 2 plots the relative effi- 
ciency of our estimator for MPCR over UMCR for 
UATE and for PATE. We do this for our 14 outcome 
variables denominated in pesos. For UATE, the esti- 
mator based on the MPCR is between 1.13 and 2.92 
times more efficient, which means that our standard 
errors would have been as much as \/2.92 = 1.7 times 
larger if we had neglected to pair clusters first. The 
result is even more dramatic for estimating PATE, 
for which the MPCR design for different variables is 
between 1.8 and 38.3 times more efficient. In this sit- 
uation, our standard errors would have been as much 
as six times larger if we had neglected to match first. 



We now use the variance results in Section 4.3 to 
calculate statistical power, that is, the probability of 
rejecting the null if it is indeed false, for UATE and 
PATE, which also represent the minimum power for 
SATE and CATE, respectively. 

5.3.1 Power calculations under the matched-pair 
design. We begin with power calculation for UATE 
given a null hypothesis of Hq : ipu = 0, the alterna- 
tive hypothesis of Ha-'iPu = tpi and the level a t- 
test. In this setting, Proposition 3 implies the power 

function, l+Tm- 



m-l,a/2 



nip/ 



m\aip{wkDk)- 

Tm.^i{t„,-i,a/2 I nV'/y'mVarp{?Z;fcDfe}), where Tm-i(- 
C,) is the distribution function of the noncentral t 
distribution with (m — 1) degrees of freedom and 
the noncentrality parameter ^, and Wk = n\k + n2k- 
For UATE, we sample cluster pairs but not units 
within each cluster. Thus, a simpler expression for 
the power function results if we assume equal cluster 
sizes. In that researcher may reparameterize 

the power function by normalizing ip in terms of 
the standard deviation of within-pair mean differ- 
ences, that is, du = ijj/ \jYav[Dk). Then, we write 
the power function as 

1 + '7^1-1 (-tm-l, a/2 I du\/m) 

(12) 

-Tm-l{tm-l,a/2 I du\/rn). 

Next, for PATE, we sample units within each clus- 
ter as well as pairs of clusters. The null hypoth- 
esis is given by H^-.tpp = and the alternative is 



< 

Q. 
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0.0 0.5 1 1.5 2 2 5 3.0 
Relative Efficiency, UATE 

Fig. 2. Relative efficiency of matched-pair over unmatched 
cluster randomized designs in the SPS evaluation. 
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Ha'-ipp = ip- Again, for simplicity, we assume n = 
njk = n/[2m) for all j and k. Then, Proposition 3 
implies the power function is of the same form as 
equation (12) except that the noncentrality param- 
eter is given by 



\ 



n 



where Wk oc Nik + -^2fc- Similar to UATE, if popu- 
lation clusters sizes are equal, we obtain a simpler 
power function 



(13) 



m—l 



-l,a/2 



-l,a/2 



dp 



m 



dp 



m 



where, for UATE, ijj is normalized by the standard 
deviation of the within-pair mean difference, dp = 

ip / J\aTp{Eu{Dk)} , and vr is the ratio of the mean 



variances of the potential outcomes and the vari- 
ance of within-pair differences- in-means by the mean 
variances of the potential outcomes, vr = 
E|=i ^p{Var„ -(yi.fc)}/ Varp(E„(Dfc)). 

5.3.2 Sample size calculations. We use the above 
results to estimate the sample size required to achieve 
a given precision in a future experiment under MPCR. 
Suppose an investigator wishes to specify the desired 
degree of precision in terms of Type I and Type II 
error rates in hypothesis testing, denoted by a and 
(5, respectively. In particular, the goal is to calcu- 
late the sample size required to achieve a given de- 
gree of power, 1 — /?, against a particular alternative 
(Snedecor and Cochran, 1989, Section 6.14), using 
the power functions just derived. For example, for 
UATE under equal cluster sizes, and using equa- 
tion (12), the desired number of cluster pairs is the 



smallest value of m such that 1 + 7^_i(— t, 
du^/rn)-Tm. 



-l,a/2 



i{tm-i,a/2 I duVm) > 1-/3 where du = 
a, and /3 are specified by the researcher. 



^/yVaFCD^ 

Similarly, for PATE, equation (13) is used to de- 
termine the number of pairs and units within each 
cluster. 

Empirical evidence. To illustrate, we use SPS eval- 
uation data on the annualized out-of-pocket health 
care expenditure that a household spent in the most 
recent month. Using estimates of vr and Varp{£'„(Dfc)} 
from the SPS data and equation (13), we calculate 
the minimal absolute effect size for PATE that can 



be detected using a two-sided t-test with size 0.95 
and power 0.8, for any given cluster size and num- 
ber of cluster pairs. Since the household is the unit 
of interest in this example, our population count in- 
volves the number of households per cluster, instead 
of the number of individuals. 

In the left panel of Figure 3, horizontal axis is 
the number of pairs and the vertical axis indicates 
the number of units within each cluster. The con- 
tour lines represent the minimum detectable size in 
pesos. The graph shows that MPCR with 30 pairs 
and 100 units within each cluster can detect the true 
absolute effect size of approximately 450 pesos with 
the given precision. The figure displays the obvious 
result that experiments with more pairs or clusters, 
can detect smaller sized effects (contour lines are la- 
beled with smaller numbers as we move to the top 
right of the figure). More importantly, the nearly 
vertical contour lines (above 50 or so units within 
each cluster) indicates that adding more pairs of 
clusters adds more statistical power than adding 
more units within each pair. However, adding one 
more pair means that many more units will be added, 
and in some situations sampling units within new 
clusters is more expensive than within existing clus- 
ters. As such, the exact tradeoff depends on the 
specifics of each application, and it would be incor- 
rect to conclude that more clusters always domi- 
nates more units. (We discuss the right panel of the 
figure next.) 

5.3.3 Power comparison. Although MPCR is typ- 
ically more efficient than UMCR regardless of sam- 
ple size, Martin et al. [(1993), page 330] point out 
that when the number of pairs is small (fewer than 
about 10), "the matched design will probably have 
less power than the unmatched design" due to the 
loss of degrees of freedom. Here, we show that this 
conclusion critically hinges on Martin et al.'s as- 
sumption of equal cluster population sizes as well as 
their particular assumed parametric model relating 
the matching and outcome variables. Modeling as- 
sumptions are always worrisome, but the equal clus- 
ter size assumption is especially problematic because 
varying cluster sizes is in fact a fundamental feature 
of numerous CR experiments. 

When cluster sizes are unequal, the efficiency gain 
of matching in CR trials depends on the correlations 
of weighted cluster means between the treatment 
and control clusters across pairs (with weights based 
on sample or population cluster sizes depending on 
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Fig. 3. Sample size calculations for PATE under MPCR. The left panel plots the smallest detectable absolute effect size of 
SPS on annualized out-of-pocket expenditures (in pesos) using a 0.05 level two-sided test with power 0.95, with tt estimated 
from SPS data. The horizontal and vertical axes plot m and n, respectively. The right panel compares correlations with and 
without population weights between treatment and control group cluster- specific means in SPS data. All but one variable has 
higher correlation when incorporating weights, as seen by a dot below the 45° line. The graph also presents "break-even" 
correlations (indicated by dashed and dotted lines with and without weights, respectively), which are the smallest possible 
correlations matching must induce in order for MPCR to detect smaller effect size than the UMCR, given fixed power (t).8 ) 
and size (t).95j. The graph suggests, when weights are appropriately taken into consideration, that MPCR should be preferred 
(for all but possibly one variable) even when the number of pairs is as small as three. 



the quantity of interest), not the unweighted cor- 
relations used in Martin et al.'s calculations. Since 
population cluster sizes are typically observed prior 
to the treatment randomization, researchers can in- 
corporate this variable into their matching proce- 
dure. As a result, correlations of weighted outcomes 
(constructed from clusters with matched weights) 
will usually be substantially higher than those of 
unweighted outcomes; this is true even when clus- 
ter sizes are independent of outcomes. Thus, in CR 
trials with unequal cluster sizes, the efficiency gain 
due to pre-randomization matching is likely to be 
considerably greater than the equal cluster size case 
considered by Martin et al. (1993). Any power com- 
parison must take this factor into consideration, and 
along with the bias reduction, this is another reason 
to incorporate cluster sizes into one's matching pro- 
cedure. 

Empirical evidence. The right panel of Figure 3 
illustrates the argument above using the SPS eval- 
uation data, by calculating the across-pair correla- 
tions between treatment and control cluster means 
of 67 outcome variables (ranging from health re- 
lated variables to household health care expenditure 
variables), both with and without weights. We use 
population cluster sizes as weights, which were ob- 
served prior to the randomization of the treatment 



and incorporated into the matching procedure used 
(King et al., 2007). The graph shows that all but one 
variable has considerably higher correlations when 
weights are incorporated (which does not make the 
equal cluster size assumption) than when they are 
ignored (which assumes constant cluster size); this 
can be seen by all but one of the dots falling below 
the solid 45° line. In fact, the median of the corre- 
lations is more than three times larger with (0.68) 
than without (0.20) weights. In their conclusion, 
Martin et al. [(1993), page 336] recommend that if 
the number of pairs is 10 or fewer, then matching 
should be used only if researchers are confident that 
the correlation due to matching is at least 0.2. In- 
deed, all variables in SPS meet this criteria if the 
weights are appropriately taken into account, the 
minimum correlation with weights being 0.22. (If 
the correlations are calculated incorrectly as they 
did without weights, then only about half of the 
variables meet their criteria.) 

To illustrate the above result in terms of power 
and sample size calculations, the graph also presents 
the "break-even" matching correlations (indicated 
by dashed and dotted lines for correlations with and 
without weights, respectively) that are used by Mar- 
tin et al. (1993), Section 7. As in the original ar- 
ticle, we set the power and size of the test to be 
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0.8 and 0.95, respectively, and derive the smallest 
correlation matching must induce in order for the 
matched-pair design to be able to detect smaller ef- 
fect sizes than the UMCR design. The result indi- 
cates that even with as few as three pairs, more than 
85% of the variables had a correlation higher than 
the break-even point, which is 0.56. With five pairs, 
all but one variable exceeds the threshold. 

In contrast, if one ignores the weights, by incor- 
rectly assuming that the clusters are equally sized, 
as in Martin et al. , then only 4% and 34% of the vari- 
ables have the correlations higher than the break- 
even correlations of three and five pairs, respectively. 
Martin et al. (1993) described the correlation of 0.25 
as "difficult to achieve by matching" (page 335). 
However, as the data from SPS evaluation show, 
since one can match on cluster sizes, the level of 
weighted correlations is much higher when cluster 
sizes are different. 

For another example, Donner and Klar [(2000b), 
page 37] give the unweighted correlations from seven 
different studies, only one of which is negative (0.49, 
0.41, 0.13, 0.63, -0.32, 0.94 and 0.21). The correct 
weighted correlations are not reported, but in all 
cases would be higher, and in all likelihood all seven 
would be positive. 

Thus, by dropping the assumption that all clusters 
are equally sized we have shown here that, for practi- 
cal purposes, the matched pair design may well have 
more statistical power than the UMCR design, even 
for small samples. Of course, if one has fewer than 
three matched pairs, it's probably time to stop wor- 
rying about the properties of statistical estimators 
and head to the field to gather more data. 

5.4 Lost Clusters, Stratified Designs and Causal 
Heterogeneity 

We now clarify four additional issues about MPCR 
that have arisen. First, some recommend a stratified 
design, where units are matched in blocks of larger 
than two. However, a stratified design is merely a 
UMCR design operating within each stratum. If all 
units within a stratum have identical values on im- 
portant background covariates, then it is effectively 
equivalent to MPCR. But if any heterogeneity on 
these covariates or cluster sizes remain within strata, 
then the stratified design may leave some efficiency 
on the table. Thus, when feasible, switching from a 
stratified to an MPCR design has the potential to 
greatly increase efficiency and power. 



Second, Donner and Klar [(2000a), page 40] ex- 
plain that a "disadvantage of the matched-pair de- 
sign is that the loss to follow-up of a single clus- 
ter in a pair implies that both clusters in that pair 
must effectively be discarded from the trial, at least 
with respect to testing the effect of the interven- 
tion. This problem. . . clearly does not arise if there 
is some replication of clusters within each combina- 
tion of intervention and stratum." Indeed, the loss 
of a single cluster from a stratum with more than 
two clusters may make it possible to estimate the 
causal effect within that stratum, but the missing- 
ness process must be ascertained or assumed and 
some type of imputation strategy (or other proce- 
dure; e.g., Wei, 1982) must be used, risking model 
dependence. These are issues for both MPCR and 
stratified designs. Alternatively, if a cluster is lost in 
an MPCR study, then dropping the other member 
of the pair makes it possible to retain the benefits 
of randomization for SATE or CATE defined in the 
remaining pairs — without losing other observations, 
without imputation and possible model dependence, 
and regardless of the missing data mechanism (King 
et al., 2007). In contrast, the loss of a cluster in a 
UMCR design turns an experimental study into an 
observational study requiring the addition of ignora- 
bility assumptions which experimentalists normally 
try to avoid. The loss of a single cluster within a 
stratum larger than two units means that more than 
one cluster will need to be dropped in order to re- 
tain the benefits of randomization, which may lead 
to unnecessary efficiency losses. 

Third is the claim that MPCR restricts "predic- 
tion models to cluster-level baseline risk factors (for 
example, cluster size)" (Donner and Klar, 2004). 
This sentence has been widely interpreted to mean 
that prediction models under MPCR cannot include 
baseline risk factors, but Donner and Klar clearly in- 
tended it to indicate (and confirmed to us that they 
meant) the more straightforward point that cluster- 
level fixed effects cannot be included in regression 
models under MPCR. Of course, results can be an- 
alyzed within strata defined by any individual or 
cluster level variable, so long as it is pre-treatment. 
For example, the bottom two rows of Table 3 repeat 
the same analysis as the top two rows but only for 
male-headed households, a variable measured only 
at the unit-level and used to separate the sample at 
that level. (The results for each quantity of interest 
in this case appear only slightly larger than for the 
entire sample.) Regression models with fixed effects 
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for clusters are unidentified under MPCR, although 
substituting in random effects is unproblematic, at 
least for identification purposes. 

Finally, Donner and Klar (2004) explain that MPCR 
is to be faulted because of its "inability to test for 
homogeneity" of causal effects within a pair. And 
hypothesis tests cannot be conducted for the differ- 
ence between two pairs. However, the causal effect is 
easy to measure without bias or model dependence 
under MPCR (but not under UMCR) at the pair 
level without bias merely by taking the difference in 
means between the two clusters. This may be a noisy 
estimate if matching quality is poor, but it serves as 
a useful unbiased dependent variable for subsequent 
analyses. We can see how it varies as a function of 
any variable measured at the unit level and then 
aggregated to the cluster-pair level, or measured di- 
rectly at the aggregate level from existing data, such 
as from census data. Even hypothesis tests are pos- 
sible if we pool pairs. For example, since the point 
of SPS was to help poor families, we could exam- 
ine whether the causal effect of rolling out SPS on 
various outcome variables increases as the wealth of 
an area drops. This can be done by a simple plot 
of the pair-level causal effect by wealth, or fitting a 
regression model. 

6. METHODS FOR UNIT-LEVEL 
NONCOMPLIANCE 

CR trials typically have imperfect treatment com- 
pliance at the unit level. Some individuals in treatment 
clusters refuse treatment while others in the con- 
trol cluster receive the treatment. Since most CR 
social experiments, including the SPS evaluation, 
allow noncompliance, analyses, in addition to ITT 
estimates, may account for noncompliance and es- 
timate the effect of the program only for individ- 
uals who would adhere to the experimental proto- 
col. Thus, we now extend our approach to CR trials 
under the MPCR encouragement design, where the 
encouragement to receive a treatment, rather than 
the receipt of the treatment itself, is randomized at 
the cluster-level. 

Angrist, Imbens and Rubin (1996) show how an 
instrumental variable method can be used to ana- 
lyze unit-randomized experiments with noncompli- 
ance under individually randomized designs. We ex- 
tend their approach to MPCR experiments with unit- 
level noncompliance. To complement the parametric 
Bayesian approach to this problem (under the un- 
matched cluster randomized design) by Prangakis, 



Rubin and Zhou (2002), we consider a design-based 
analysis applying the approach introduced in Sec- 
tion 4. 

6.1 Causal Quantities of Interest 

We consider the two types of causal quantities of 
interest under MPCR encouragement designs — the 
intention-to-treat (ITT) effect and the compiler av- 
erage causal effect (CACE) (Angrist, Imbens and 
Rubin, 1996). The ITT effect is the average causal 
effect of encouragement (rather than treatment) and 
is equivalent to the various versions of the average 
treatment effect in Section 3.3 (i.e., SATE, CATE, 
UATE and PATE, with or without interference). 

In contrast, the CACE estimand is the average 
treatment effect (for SATE, CATE, UATE or PATE, 
with or without interference) among compilers only. 
Compilers are neither those merely observed to af- 
filiate among those in encouragement clusters nor 
those observed not to affiliate in clusters not encour- 
aged since the former includes always-takers and the 
latter includes never-takers. Note that always-takers 
(never-takers) are those who always (never) take the 
treatment regardless of whether or not they are en- 
couraged. In addition, these groups are defined as a 
consequence of the treatment. Compilers are those 
who would affiliate only if they were encouraged and 
would not affiliate only if they were not encouraged, 
and so this group is defined prior to the encourage- 
ment but its members are not completely observed. 
We propose a method that can be used to estimate 
CACE. 

6.2 Design and Notation 

The setup is the same as Section 3.2 except that 
Tj-fc represents whether the units in the jth clus- 
ter in the A;th pair are encouraged to receive the 
treatment rather than whether it received the treat- 
ment. Recall that Tik = Zk and T2k = 1 — Zk- Now, 
let RijkiTjk) be the potential treatment receipt in- 
dicator variables for the zth unit in the jth clus- 
ter of the kth pair under the encouragement (Tj^ = 
1) and control {Tjk = 0) conditions. The observed 
treatment variable is, then, Rijk = TjkRijki^) + (1 — 
Tjk)Rijk{0)- Similar to the potential outcomes, these 
potential treatment variables depend on cluster-level 
encouragement variable rather than the unit-level 
encouragement variable, requiring a different inter- 
pretation of the resulting causal effects. Finally, we 
write the potential outcomes as functions of (cluster- 
level) randomized encouragement and actual receipt 
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of treatment (at the unit-level) , that is, Yijk{Rijk,Tjk). 
This formulation makes the following assumption, 
which an extension of Assumption 1: 

Assumption 3 (No interference between units). 
Let Rijk{T) be the potential outcomes for the ith 
unit in the jth cluster of the kth matched-pair where 
T is an (m x 2) matrix whose {j,k) element is Tj^. 
Furthermore, letYijk{¥i,T) he the potential outcomes 
for the ith unit in the jth cluster of the kth matched- 
pair where R is an {uj^ x m x 2) ragged array whose 
(i,j,k) element is Rijk- Then: 

1. IfTjk = Tj„ then ^^(T) = Ri.kiT'). 

2. // Tjk = T'-f^ and Rijk = R'ij^, then YijkiK, T) = 
^jfc(R',T'). 

In other words, this assumption requires that one 
person's decision to affiliate has no effect on any 
other person's outcomes within the same cluster; as 
such, the requirements are more demanding than for 
the ITT effects above. This assumption might be vi- 
olated for certain health outcomes in the SPS evalu- 
ation: if all of one's neighbors affiliate with SPS, the 
health care they receive may reduce the prevalence 
of infectious diseases and so might thereby improve 
that person's health outcomes (an example of "herd 
immunity" ) . Relaxing this assumption thus remains 
an important methodological issue that seems wor- 
thy of future research. 

The no interference assumption allows us to write 
-Rijfc(T) = Rijk{Tjk) and YijkCR, T) = Yijk{Rijk,Tjk) . 
Since Tik = Zt and T2fc = 1 - Z^, both RijkiTjk) and 
Yijk{Tjk) depend on Zk alone. 

Extending the framework of Angrist, Imbens and 
Rubin (1996) to CR trials, we make an exclusion re- 
striction so that cluster-level encouragement affects 
the unit-level outcome only through the unit-level 
receipt of the treatment: 

Assumption 4 (Exclusion restriction). Yijk{r, 
0) = l^jfc(r, 1) for r = 0, 1 and all i,j, and k. 

These assumptions together simplify the problem 
by enabling us to write the potential outcomes as 
functions of Tj^ (or Z^) alone, that is, Yijk{Rji.,Tjk) = 

Finally, following Angrist, Imbens and Rubin 
(1996), we call the units with Rijk(Tjk) = Tjk com- 
pliers (and denote them by Cijk = c), those with 
Rijk{Tjk) = 1 always-takers {Cijk = a), those with 
Ri'jk{Tjk) = never-takers {Cijk = ""-)) and the units 
with Rijk{Tjk) = 1 — Tjk defiers {Cijk = d). The mono- 
tonicity assumption excludes the existence of defiers. 



Assumption 5 (Monotonicity). There exists no 
defier. That is, Rijk{l) > Rijk{0) holds for all k. 

In our Mexico evaluation, never-takers are those 
who would not affiliate with SPS regardless of whether 
the government encourages them to do so or not. 
Since SPS was designed for the poor, many wealthy 
citizens with their own preexisting health care ar- 
rangements may be never-takers. We expected a sub- 
stantial proportion of the population to qualify as 
never-takers, and in fact estimate them at 56%. 
Always-takers are those who would affiliate with 
SPS regardless of assignment. These are more un- 
common, and would likely be the poor without ac- 
cess to health care who nevertheless have the infor- 
mation and financial resources necessary to travel to 
the place to sign up for SPS and to travel back to 
receive care. (The estimated proportion of always- 
takers is only 7%.) The last type is defiers, or people 
who would affiliate with SPS if not encouraged to do 
so but would not affiliate if encouraged. Assuming 
the absence of defiers seems reasonable. 

6.3 Estimation 

If we assume sampling of both pairs of clusters 
and units within each cluster, then the ITT causal 
effect can be defined as ipp. Thus, ip{Nik + N2k) 
can be used to estimate this ITT effect, and the 
approximately unbiased estimation of its variance is 
possible using the results given in Section 4.3. 

Next, we consider population CAGE. Under the 
assumption of simple random sampling of both clus- 
ters and units within each cluster, this estimand is 
defined as 7 = Ep(r(l) - Y{0)\C = c) = Ep(y(l) - 
Y{0))/E'p{R{l) - R{0)), where the equality follows 
from the direct application Angrist, Imbens and Ru- 
bin (1996) to CR trials under the assumptions stated 
above. If we only assume simple random sampling 
of clusters as in UATE, then the expectation in 7 is 
taken with respect to the set U rather than V. 

Thus, the instrumental variable estimator based 
on the general weighted estimator in equation (3) is 

j{wk) = tjj{wk)/f{wk), where f{wk) is the estimator 
of the ITT effect on the receipt of the treatment 

r{wk) 
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When matching is effective or when cluster sizes 
are equal within each matched-pair, this estimator 
is consistent and approximately unbiased. Using a 
Taylor series expansion, the variance of this estima- 
tor can be approximated by 



Var 



apui'liwk)) 
1 



(14) 



{Eapuif{wk))V 



- 2Eapu{tp{wk))Eapu{f{wk)) 

• CoVapu{Tp{Wk),f{Wk))], 

where if simple random sampling of pairs of clus- 
ters alone is assumed, then the subscript ^^apu''^ (for 
assignment, pairs, and units) is replaced with "ap." 
Furthermore, the argument given in Section 4.3 im- 
plies, for example, that the variance of 'j{wk) for es- 
timating the sample CAGE is on average less than 
the variance for the population CAGE given in equa- 
tion (14). 

Finally, Proposition 3 shows how to estimate 
Varap„(V'(tyfc)), VaVapuiTiwk)) (or VaTapiipiwk)) and 
Varap{f{wk))) approximately without bias. Thus, we 
only need an estimate of the covariance between of 
ij{wk) and f{wk) from the observed data. Using the 
normalized weights Wk, Appendix A. 5 proves that 
the following estimator is approximately unbiased 
for both Coy apu{'tp{wk),f{wk)) and CoYpu{'>p{vJk), 
f{wk)) under their respective sampling assumptions: 
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7. SEGURO POPULAR EVALUATION 

We now estimate the causal effect of SPS on the 
probability of a household suffering catastrophic health 
expenditures (out-of-pocket health care expenditures 
totaling more than 30% of a household's annual post- 
subsistence or disposable income). As nearly 10% of 
households suffer catastrophic health expenditures 
in a year, it is easy to see why this would be a 
major priority. We estimate all four target popu- 
lation quantities of interest (SATE, GATE, UATE, 
and PATE) both for the intention to treat (ITT) 
effect of encouragement to affiliate an the average 
causal effect among compilers (GAGE). Although in 
most applications, substantive interest would nar- 
row this list to one or a few of these quantities, for 
our methodological purposes we present all eight es- 
timates (and standard errors) in Table 3. 

A table like this will always have some of the same 
features, no matter what variable is analyzed. Re- 
call, for example, that point estimates of SATE and 
UATE are the same, as they are for GATE and 
PATE. In addition, standard errors of UATE and 
PATE are the upper bounds of the standard errors 
for SATE and GATE, respectively. GAGE estimates 
of course are never smaller than those for ITT. 

For the specific estimates, consider first the two 
top lines of Table 3 corresponding to all households. 
For these data, the GAGE estimates are about 2.7 
times larger than that for ITT. The large difference 
is because of all those who had preexisting health 
care and so were largely never-takers. Overall, these 
results indicate that SPS was clearly successful in 
reducing the most devastating type of medical ex- 
penditures. The differences among the columns in- 
dicate that the average causal effect of encourage- 
ment to affiliate to SPS (the ITT effect) is somewhat 
larger in the population of individuals represented 
by our sample (—0.023) than among the individuals 
we directly observe (—0.014). The same is also true 
among compilers, but at a higher level (—0.038 vs. 
-0.064). 

Substantively, these numbers are quite large. Since 
those who suffer from catastrophic health expendi- 
tures are mostly the poor without access to health 
insurance, they are likely to be disproportionately 
represented among compilers as compared to the 
wealthy with preexisting health care arrangements. 
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Estimates of eight causa 



Table 3 

effect of SPS on the probability of catastrophic health expenditures for all households and 
male-headed households (standard errors in parentheses) 







SATE 


GATE 


UATE 


PATE 


All 


ITT 


-0.014 (< 0.007) 


-0.023 (< 0.015) 


-0.014 (0.007) 


-0.023 (0.015) 




CACE 


-0.038 (< 0.018) 


-0.064 (< 0.024) 


-0.038 (0.018) 


-0.064 (0.024) 


Male-headed 


ITT 


-0.016 (< 0.008) 


-0.025 (< 0.018) 


-0.016 (0.008) 


-0.025 (0.018) 




CACE 


-0.042 (< 0.020) 


-0.070 (< 0.031) 


-0.042 (0.020) 


-0.070 (0.031) 



As such, this analysis indicates that the causal ef- 
fect of rolling out the policy reduces by about 23% 
the proportion of those who experience catastrophic 
expenditures (i.e., —0.023 of the 10% with catas- 
trophic expenditures). (Detailed analyses of these 
and other substantive results from the SPS evalua- 
tion appear in King et al., 2009.) 

8. CONCLUDING REMARKS 

The methods developed here are designed for re- 
searchers lucky enough to be able to randomize treat- 
ment assignment, but stuck because of political or 
other constraints with having to randomize clus- 
ters of individuals rather than the individuals them- 
selves. Field experiments in particular frequently re- 
quire cluster randomization. Individual-level random- 
ization was impossible in our evaluation of the Mex- 
ican SPS program; in fact, negotiations with the 
Mexican government began with the presumption 
that no type of randomization would be politically 
feasible, but it eventually concluded by allowing 
cluster- level randomization to be implemented. 

When clusters of individuals are randomized rather 
than the individuals themselves, the best practice 
should involve three steps. First, researchers should 
choose their causal quantity of interest, as defined 
in Section 3.3. They should then identify available 
pre-treatment covariates likely to affect the outcome 
variable, and, if possible, pair clusters based on the 
similarity of these covariates and cluster sizes; this 
step is severely underutilized and, when feasible, 
will translate into considerable research resources 
saved and numerous observations gained. Finally, re- 
searchers should randomly choose one treated and 
one control cluster within each pair. Claims in the 
literature about problems with matched-pair clus- 
ter randomization designs are misguided: clusters 
should be paired prior to randomization when con- 
sidered from the perspective of efficiency, power, 
bias or robustness. 



Of course, administrative, political, ethical and 
other issues will sometimes constrain the ability of 
researchers to pair clusters prior to randomization. 
With the results and new estimators offered here, 
the effort in the design of cluster-randomized exper- 
iments can now shift from debates about when pair- 
ing is useful to practical discussions of how best to 
marshal creative arguments and procedures to en- 
sure that clusters can more often be paired prior to 
randomization. 

Cornfield [(1978), pages 101-102] concludes his 
now classic study by writing that "Randomization 
by cluster accompanied by an analysis appropriate 
to randomization by individual is an exercise in self- 
deception, . . . and should be avoided," and an enor- 
mous literature has grown in many fields echoing 
this warning. We can now add that randomization 
by cluster without prior construction of matched 
pairs, when pairing is feasible, is an exercise in self- 
destruction. Failing to match can greatly reduce ef- 
ficiency, power and robustness, and is equivalent to 
discarding a large portion of experimental data or 
wasting grant money and investigator effort. This re- 
sult should affect practice, especially in literatures 
like political science where experimental analyses 
routinely use cluster-randomization but examples of 
matched-pair designs have almost never been used, 
as well as community consensus recommendations 
for best practices in the conduct and analysis of 
cluster-randomized experiments, which closely fol- 
low prior methodological literature. These include 
the extension to the "CONSORT" agreement among 
the major biomedical journals (Campbell, Elbourne 
and Altman, 2004), the Cochrane Collaboration re- 
quirements for reviewing research (Higgins and Green, 
2006, Section 8.11.2), the prominent Medical Re- 
search Council (2002) guidelines, and the education 
research What Works Clearinghouse (2006). Each 
would seem to require crucial modifications in light 
of the results given here. 
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APPENDIX A: MATHEMATICAL APPENDIX 
A.l Proof of Proposition 1 



This proof uses a strategy similar to that of Propo- 
sition 1 of Imai (2008). First, rewrite (T{wk) as 

(m - l)n^ 



m 
All? 



Y?iT{Wk{Dk{l) + Dkm}- 



m 

m 

= E 

k=i 



-a{wk) 



Wk{ZkDk{l) + {l-Zk)Dkm 



m 



^ Wk'{Zk'Dk'{l) 



This bias term is not identifiable because -Dfc(l) and 
-Dfc(O) are not jointly observed for any k, implying 
that the variance is not identifiable either. 

A. 2 Proof of Proposition 2 

Applying the law of iterated expectations to equa- 
tion (15), we have 



k'=l 



+ {l-Zk')Dk'{0)} 



m 

Y,wlEu{Dk{lf + DkiOf} 



m — 1 



1 

2^ 



m 



J2u'l{ZkDk{lf + {l-Zk)Dk{0)^} 

k=l 



Lk=l 



(16) 



1 



— y] y] wkWk'{ZkZk'Dk{i)Dk'{i) 



2(m-l) 



k=lk'^k 



E E WkWk'Eu{Dk{l) + Dkm 

k=lk'^k 



■E^iDk'il)+Dk'iO)) 



+ Zk{l-Zk>)Dk{l)Dk'{0) 

+ il-Zk)Zk>DkiO)Dk>il) 

+ il-Zk)il-Zk>)DkiO) 

^ where the equality follows from the assumption that 

^' ^ sampling of units is independent across clusters. To- 

Assumption 2 implies Ea{Zk) = 1/2 and Ea{ZkZk>) = gether with the definition oi\aiau{i^{wk)) given above, 

1/4 for k ^ k' . Thus, taking expectations over Z^ we have 
and rearranging, gives 



Ea{cy{Wk)) 



Eau{^{Wk)) - yaVau{-ip{wk)) 
m 

Y,wl{Eu{Dk{lf+Dk{Qf) 



1 



.k=l 



.k=l 



(15) 



Var„(Z)fc(l))-Var„(Z)fe(0)) 
+ 2E^{Dkil))E^{DkiO))} 



■iDk{l)+Dkm 

■{Dk'{l) + Dk'm^. 

Finally, we compare this with the true variance ex- 
pression in (7): Ea^a^Wk)) — yara{'4>{wk)) , which equals 



m — 1 

m 



4n2 I 

Kk=l 



Y,wl{Dk{l) + Dk{0)y 



m 



■ J2 E WkWk'Eu{Dk{l)+Dkm 

k=lk'jtk 
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m 



1 m 

EE WkWk'{Dk{l) + Dk{0)) Since we do not observe Dfc(l) and L»fe(0) jointly, 

~ 1 this variance is not identified. 



k=lk'^k 
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A. 3 Proof of Proposition 3 

Since UATE is a special case of PATE where all 
units within each cluster are observed {rijk = -/Vjfc), 
we first derive the variance of ip{wk) for PATE. Let 
Dkit) = WkDkit), jlk{t) = Eu(Dk{t)), and r?fc(t) = 
\aiu{Dk{t)) for t = 0, 1. Then, randomizing the or- 
der of clusters within each pair implies Ecifjk) = 
EciVkil)) = Ec{fikiO)) and Varc(/ife) = Varc(/ifc(l)) = 
Varc(/ifc(0)). Then, the variance is 



Var 



apu 



Var,(A.(l))+Var„pfc(0)) 



+ ^{E^{Dk{l)-DkiO))}' 



YaipiE^iDkil) + DkiO))) 



1 



;{Ep{f]k) + Var p{fik)}. 



When the estimand is UATE, fjk = for all k since 
within-cluster means are observed without sampling 
variability. Thus, Varap„('0(^ifc)) = Varp(/i/fc)/(?Ti?D^). 

Next, we show that &{wk) is approximately unbi- 
ased by applying the law of iterated expectations to 
Equation (16): 



Ep{wlE^iDkilf + DkiOf)} 
-^[Ep{wkEuiDk{l)+Dkm}? 



mw 



:[^p{Var„(L»fc)} + Varp(/ife)], 



where E^iDl) = E^{Dk{^f) = Eu{Dk{lf) holds be- 
cause the order of clusters within each pair is ran- 
domized. For PATE, Varu(Dfc) = for all k since 
within-cluster means are observed without sampling 
uncertainty. Thus, Eap{a{wk)) = \aip{jik) / {muP') . 

A. 4 Properties of the Harmonic Mean Estimator 
and Standard Error 

Modeling assumptions. The harmonic mean esti- 
mator, with weights based on the harmonic mean of 
sample cluster sizes Wk = nikn2k/{nik + n2k), stems 
from the weighted one-sample t-test for the differ- 



ence in means: Dk^'^'^^ Af{fJ,,{wk/J2k'=i'^k')~^(^) 
for k= 1,2, ... ,m where Wk is the known harmonic 
mean weight. In our context, Dk is the observed 
within-pair mean difference, that is, Dk = ZkDk{l) + 
(1 - Zk)Dk{0) where Dk{l) = E^l'iYnkW/riik - 
EZ'{Y^2k{0)/n2k and Dk{0) = EZ\Yi2k{l)/n2k - 
X]"=i ^iifc(0)/nifc. It is well known that under this 
model, J2T=i '^kDk/J2T'=i ^fe' the uniformly min- 
imum variance unbiased estimator. 

Although the derivation of this model is not dis- 
cussed in the cluster randomization literature, a model 
commonly used in the statistics literature for other 
purposes gives rise to these weights (see e.g., Kalton, 

1968): Y,jk{t) ''-'^- Mifit,^) for t = 0,1 where a = 
a J2T=i ^fc and J2T=i '^k is a known constant since 
Wk is assumed fixed. The normality assumption is 
not necessary for some inferential purposes, but this 
model does require (1) independent and identical 
distributions across units within each cluster as well 
as (2) across clusters and pairs (which of course 
implies constant means and variances within and 
across clusters and pairs) and (3) equal variances 
for the two potential outcomes. In sum, the model 
assumes homogeneity within and across matched- 
pairs. [Although we focus on the t-test here, for bi- 
nary outcomes the suggested approach in the liter- 
ature is also based on a homogeneity assumption 
where the odds ratio is assumed constant across 
clusters; see, e.g., Donner and Donald (1987); Don- 
ner and Hauck, (1989).] 

Bias conditions. The harmonic mean weight dif- 
fers from our proposed weight in three ways. First, 
it gives more weight to pairs with well-matched clus- 
ter sizes than to pairs whose cluster sizes are unbal- 
anced. That is, if we assume the sum nik + ?T-2fc is 
fixed, the harmonic mean is the largest when uik = 
n2k and becomes smaller as nik — n2k increases. Sec- 
ond, and most importantly, this weight does not 
remove the bias when within-cluster average treat- 
ment effects are identical within pairs (so long as 
heterogeneity across matched-pairs remains), mean- 
ing that bias may remain even when matching is 
effective. (The direction of the bias depends on the 
data.) One condition under which it is unbiased is 
with exact matching on sample cluster sizes (i.e., 
nik = T^2k for all k), in which case this estimator co- 
incides with our proposed estimator. Finally, since 
the weight is based on sample cluster sizes, this es- 
timator is not valid for estimating GATE or PATE. 
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When its assumptions hold, the harmonic mean esti- 
mator is uniformly minimum variance unbiased, and 
is clearly useful in those circumstances. 

Bias in the variance estimator. We show here that 
the variance estimator proposed in the literature 
(see, e.g., Donner, 1987; Donner and Donald, 1987; 
Donner and Klar, 1993) may be biased regardless 
of choice of weights and the direction of bias is in- 
determinate. A condition under which this variance 
estimator is unbiased (and approximately equal to 
ours) is when m is large and the weights are identi- 
cal across pairs, which is uncommon in practice. We 
first write this estimator using our notation: 



6{wk) 



k=l 



i=l Z-ii=l ^ I 



i2k 



n2k 



(17) 



n-ik 
+ (1 - Zk) 

i=l li2k l^i=l J^tlk 



n2k 



nik 



Next, we rewrite S{wk) as 



n 



2^k=l ^k 



k=l 



ZkDkil) + {I - Zk)DkiO) 



n 



^ Wk'{Zk'Dk'{l) 



k'=l 



k=l 



[i-Zk>)Dk'm 



ZkDkilf + (1 - Zk)DkiO)' 



- Wk>{ZkDk{l) 



k'=l 
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+ 4E E 



~2 ~2 



■{Zk'Dk'il) 
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+ {l-Zk")Dk"m 



Taking the expectation with respect to Z^, Ea{S{ijJk)), 
gives 



y-m j2 rn 

2n3 



2n 



E E wkWk'{Dk{i)+Dkm 



k=lk'^k 



(l?fc,(l) + ^fe'(0)) • 



Comparing this expression with Ea{cr{wk)) in equa- 
tion (15) shows a difference which remains even after 
taking the expectation with respect to simple ran- 
dom sampling of pairs of clusters or units within 
clusters. Since (j{ivk) is an approximately unbiased 
estimate of the variance for UATE and PATE, 5{wk) 
may be biased. 

A. 5 Covariance Estimation 

This Appendix derives unbiased estimates of 
CoVauc{'4'{wk),f{wk)) and Cov ac{'4'{wk),T{wk)) us- 
ing the proofs in Propositions 1-3. First, we derive 
the true covariance between ijj{wk) and f{wk)- De- 



:Er=l^*lfc(l)Mfc 

Y.l^^R^2k{l)/n2k■ 



-T.Z\Ri2k{^)/n2k 

Er=i^iifc(o)/nifc- 



fine Gkil) 
and Gfc(O) 

Taking the expectation of with respect to Zk yields: 
Co^a{'^{wk).f{wk)) = ^Y.t=iwl{Dk{l) - DkiO)) ■ 
(Gfc(l) - Gfc(O)). Then, we have 

COV ap{'4'i'Wk),T{Wk)) 

= Ep{C0Va{Tp{Wk),T{Wk))} 

+ Coyp{Ea{TpiWk)),Ea{f{wk))} 
1 ~ ~ 

Covp{Dk,Gk), 



k'=ik"=i 



where Gfc(t) = WkGk{t) for t = 0, 1, and the last equal- 
ity follows from the fact that Ep{Dk) = Ep(Dk{t)), 

Ep{Gk) = Ep{Gk{t)) and Ep{DkGk) = Ep{Dk{t)Gk{t)) 
for t = 0, 1. Similarly, 

Co\ au{'4^{Wk),T{wk)) 
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